Loading R packages

This workflow is adapted from: https://bioconductor.org/packages/release/workflows/vignettes/GeoMxWorkflows/inst/doc/GeomxTools_RNA-NGS_Analysis.html#71_Within_Slide_Analysis:_Glomeruli_vs_Tubules

library(knitr)
knitr::opts_chunk$set(dpi=300, warning = FALSE, message = FALSE) # suppress warnings
library(SpatialDecon)
library(GeomxTools)
library(NanoStringNCTools)
library(GeoMxWorkflows)

library(ggplot2)
library(Seurat)
library(patchwork)
library(pheatmap)
library(dplyr)

Loading data

The key data files are:

  1. DCCs files - expression count data and sequencing quality metadata
  2. PKCs file(s) - probe assay metadata describing the gene targets present in the data, PKC files can be found here
  3. Annotation file - useful tissue information, including the type of segment profiled (ex: glomerulus vs. tubule), segment area/nuclei count, and other tissue characteristics (ex: diseased vs. healthy). If working with a new dataset, use the lab worksheet from the GeoMx instrument study readout package, as the annotation order of NTCs is important to ensure proper processing of files.
datadir <- file.path("/Users/helenhuang/Documents/1st\ Year\ PhD/Hoffmann\ Lab/P01\ Liver\ IRI\ project/GSE217936_Liver")
# datadir <- file.path("/Users/helenhuang/Documents/1st\ Year\ PhD/Hoffmann\ Lab/P01\ Liver\ IRI\ project/Example/Kidney_Dataset")

# automatically list files in each directory for use
DCCFiles <- dir(file.path(datadir, "dccs"), pattern = ".dcc$",
                full.names = TRUE, recursive = TRUE)
PKCFiles <- dir(file.path(datadir, "pkcs"), pattern = ".pkc$",
                                full.names = TRUE, recursive = TRUE)
AnnotationFile <- dir(file.path(datadir, "annotation"), pattern = ".xlsx$",
      full.names = TRUE, recursive = FALSE)

Data <- readNanoStringGeoMxSet(dccFiles = DCCFiles,
                         pkcFiles = PKCFiles,
                         phenoDataFile = AnnotationFile,
                         phenoDataSheet = "Template2",
                         phenoDataDccColName = "Sample_ID",
                         protocolDataColNames = c("aoi", "roi"),
                         experimentDataColNames = c("panel"))

# access the PKC files to ensure that the expected PKCs have been loaded for this study.
pkcs <- annotation(Data)
modules <- gsub(".pkc", "", pkcs)
kable(data.frame(PKCs = pkcs, modules = modules))
PKCs modules
Mm_R_NGS_WTA_v1.0.pkc Mm_R_NGS_WTA_v1.0

QC & Pre-processing

Before we begin, we will shift any expression counts with a value of 0 to 1 to enable in downstream transformations.

# Shift counts to one
Data <- shiftCountsOne(Data, useDALogic = TRUE)

First, we select the QC parameter cutoffs, against which our ROI/AOI segments will be tested and flagged appropriately. The default QC values recommended in the brackets are advised when surveying a new dataset for the first time.

# Default QC cutoffs are commented in () adjacent to the respective parameters
# study-specific values were selected after visualizing the QC results2 in more
# detail below
QC_params <-
  list(minSegmentReads = 1000, # Minimum number of reads (1000)
       percentTrimmed = 80,    # Minimum % of reads trimmed (80%)
       percentStitched = 80,   # Minimum % of reads stitched (80%)
       percentAligned = 75,    # Minimum % of reads aligned (80%)
       percentSaturation = 50, # Minimum sequencing saturation (50%)
       minNegativeCount = 1,   # Minimum negative control counts (10)
       maxNTCCount = 33773,     # Maximum counts observed in NTC well (1000)
       minNuclei = 100,         # Minimum # of nuclei estimated (100)
       minArea = 5000)         # Minimum segment area (5000)
Data <-
  setSegmentQCFlags(Data, 
                    qcCutoffs = QC_params)        

# Collate QC results2
QCresults2 <- protocolData(Data)[["QCFlags"]]
flag_columns <- colnames(QCresults2)
QC_Summary <- data.frame(Pass = colSums(!QCresults2[, flag_columns]),
                         Warning = colSums(QCresults2[, flag_columns]))
QCresults2$QCStatus <- apply(QCresults2, 1L, function(x) {
  ifelse(sum(x) == 0L, "PASS", "WARNING")
})
QC_Summary["TOTAL FLAGS", ] <-
  c(sum(QCresults2[, "QCStatus"] == "PASS"),
    sum(QCresults2[, "QCStatus"] == "WARNING"))

Before excluding any low-performing ROI/AOI segments, we visualize the distributions of the data for the different QC parameters. Note that the “Select Segment QC” and “Visualize Segment QC” sections are performed in parallel to fully understand low-performing segments for a given study. Iteration may follow to select the study-specific QC cutoffs.

For QC visualization, we write a quick function to draw histograms of our data.

col_by <- "segment"

# Graphical summaries of QC statistics plot function
QC_histogram <- function(assay_data = NULL,
                         annotation = NULL,
                         fill_by = NULL,
                         thr = NULL,
                         scale_trans = NULL) {
  plt <- ggplot(assay_data,
                aes_string(x = paste0("unlist(`", annotation, "`)"),
                           fill = fill_by)) +
    geom_histogram(bins = 50) +
    geom_vline(xintercept = thr, lty = "dashed", color = "black") +
    theme_bw() + guides(fill = "none") +
    facet_wrap(as.formula(paste("~", fill_by)), nrow = 4) +
    labs(x = annotation, y = "Segments, #", title = annotation)
  if(!is.null(scale_trans)) {
    plt <- plt +
      scale_x_continuous(trans = scale_trans)
  }
  plt
}

Now we explore each of the QC metrics for the segments.

QC_histogram(sData(Data), "Trimmed (%)", col_by, 80)

QC_histogram(sData(Data), "Stitched (%)", col_by, 80)

QC_histogram(sData(Data), "Saturated (%)", col_by, 50) +
  labs(title = "Sequencing Saturation (%)",
       x = "Sequencing Saturation (%)")

QC_histogram(sData(Data), "area", col_by, 9000, scale_trans = "log10")

QC_histogram(sData(Data), "nuclei", col_by, 100)

QC_histogram(sData(Data), "NTC", col_by, 9000)

# calculate the negative geometric means for each module
negativeGeoMeans <- 
  esBy(negativeControlSubset(Data), 
       GROUP = "Module", 
       FUN = function(x) { 
         assayDataApply(x, MARGIN = 2, FUN = ngeoMean, elt = "exprs") 
       }) 
protocolData(Data)[["NegGeoMean"]] <- negativeGeoMeans

# explicitly copy the Negative geoMeans from sData to pData
negCols <- paste0("NegGeoMean_", modules)
pData(Data)[, negCols] <- sData(Data)[["NegGeoMean"]]
for(ann in negCols) {
  plt <- QC_histogram(pData(Data), ann, col_by, 1, scale_trans = "log10")
  print(plt)
}

# detatch neg_geomean columns ahead of aggregateCounts call
pData(Data) <- pData(Data)[, !colnames(pData(Data)) %in% negCols]

Finally we plot all of the QC Summary information in a table.

kable(QC_Summary, caption = "QC Summary Table for each Segment")
QC Summary Table for each Segment
Pass Warning
LowReads 141 0
LowTrimmed 141 0
LowStitched 141 0
LowAligned 132 9
LowSaturation 139 2
LowNegatives 141 0
HighNTC 141 0
LowNuclei 141 0
LowArea 141 0
TOTAL FLAGS 130 11

As the final step in Segment QC, we remove flagged segments that do not meet our QC cutoffs.

Data <- Data[, QCresults2$QCStatus == "PASS"] # Subsetting our dataset has removed samples which did not pass QC
dim(Data)
## Features  Samples 
##    20175      130

Probe (gene-level) QC

Before we summarize our data into gene-level count data, we will remove low-performing probes. In short, this QC is an outlier removal process, whereby probes are either removed entirely from the study (global) or from specific segments (local). The QC applies to gene targets for which there are multiple distinct probes representing the count for a gene per segment. In WTA (Whole Transcriptome Atlas) data, one specific probe exists per target gene; thus, Probe QC does not apply to the endogenous genes in the panel. Rather, it is performed on the negative control probes; there are multiple probes representing our negative controls, which do not target any sequence in the genome. These probes enable calculation of the background per segment and will be important for determining gene detection downstream.

After Probe QC, there will always remain at least one probe representing every gene target. In other words, Probe QC never removes genes from your data.

A probe is removed globally from the dataset if either of the following is true: * the geometric mean of that probe’s counts from all segments divided by the geometric mean of all probe counts representing the target from all segments is less than 0.1 * the probe is an outlier according to the Grubb’s test in at least 20% of the segments A probe is removed locally (from a given segment) if the probe is an outlier according to the Grubb’s test in that segment.

# Generally keep the qcCutoffs parameters unchanged. Set removeLocalOutliers to 
# FALSE if you do not want to remove local outliers
Data <- setBioProbeQCFlags(Data, 
                               qcCutoffs = list(minProbeRatio = 0.1,
                                                percentFailGrubbs = 20), 
                               removeLocalOutliers = TRUE)

ProbeQCresults2 <- fData(Data)[["QCFlags"]]

# Define QC table for Probe QC
qc_df <- data.frame(Passed = sum(rowSums(ProbeQCresults2[, -1]) == 0),
                    Global = sum(ProbeQCresults2$GlobalGrubbsOutlier),
                    Local = sum(rowSums(ProbeQCresults2[, -2:-1]) > 0
                                & !ProbeQCresults2$GlobalGrubbsOutlier))

Exclude Outlier Probes

#Subset object to exclude all that did not pass Ratio & Global testing
ProbeQCPassed <- 
  subset(Data, 
         fData(Data)[["QCFlags"]][,c("LowProbeRatio")] == FALSE &
           fData(Data)[["QCFlags"]][,c("GlobalGrubbsOutlier")] == FALSE)
dim(ProbeQCPassed)
## Features  Samples 
##    20175      130
Data <- ProbeQCPassed 

Create Gene-level Count Data

With our Probe QC steps complete, we will generate a gene-level count matrix. The count for any gene with multiple probes per segment is calculated as the geometric mean of those probes.

# Check how many unique targets the object has
length(unique(featureData(Data)[["TargetName"]]))
## [1] 19963
# collapse to targets
target_Data <- aggregateCounts(Data)
dim(target_Data)
## Features  Samples 
##    19963      130
exprs(target_Data)[1:5, 1:2]
##               DSP-1001660005669-F-A02.dcc DSP-1001660005669-F-A03.dcc
## 0610009B22Rik                          17                          21
## Sanbr                                   4                           8
## 0610010K14Rik                          11                          22
## 0610012G03Rik                          21                          16
## 0610030E20Rik                          12                          12

Filtering based on LOQ

In addition to Segment and Probe QC, we also determine the limit of quantification (LOQ) per segment. The LOQ is calculated based on the distribution of negative control probes and is intended to approximate the quantifiable limit of gene expression per segment. Please note that this process is more stable in larger segments. Likewise, the LOQ may not be as accurately reflective of true signal detection rates in segments with low negative probe counts (ex: <2).

We typically use 2 geometric standard deviations (n=2) above the geometric mean as the LOQ, which is reasonable for most studies. We also recommend that a minimum LOQ of 2 be used if the LOQ calculated in a segment is below this threshold.

# Define LOQ SD threshold and minimum value
cutoff <- 2
minLOQ <- 2

# Calculate LOQ per module tested
LOQ <- data.frame(row.names = colnames(target_Data))
for(module in modules) {
  vars <- paste0(c("NegGeoMean_", "NegGeoSD_"),
                 module)
  if(all(vars[1:2] %in% colnames(pData(target_Data)))) {
    LOQ[, module] <-
      pmax(minLOQ,
           pData(target_Data)[, vars[1]] * 
             pData(target_Data)[, vars[2]] ^ cutoff)
  }
}
pData(target_Data)$LOQ <- LOQ

After determining the limit of quantification (LOQ) per segment, we recommend filtering out either segments and/or genes with abnormally low signal. Filtering is an important step to focus on the true biological data of interest.

We determine the number of genes detected in each segment across the dataset.

LOQ_Mat <- c()
for(module in modules) {
  ind <- fData(target_Data)$Module == module
  Mat_i <- t(esApply(target_Data[ind, ], MARGIN = 1,
                     FUN = function(x) {
                       x > LOQ[, module]
                     }))
  LOQ_Mat <- rbind(LOQ_Mat, Mat_i)
}
# ensure ordering since this is stored outside of the geomxSet
LOQ_Mat <- LOQ_Mat[fData(target_Data)$TargetName, ]

We first filter out segments with exceptionally low signal. These segments will have a small fraction of panel genes detected above the LOQ relative to the other segments in the study. Let’s visualize the distribution of segments with respect to their % genes detected:

# Save detection rate information to pheno data
pData(target_Data)$GenesDetected <- 
  colSums(LOQ_Mat, na.rm = TRUE)
pData(target_Data)$GeneDetectionRate <-
  pData(target_Data)$GenesDetected / nrow(target_Data)

# Determine detection thresholds: 1%, 5%, 10%, 15%, >15%
pData(target_Data)$DetectionThreshold <- 
  cut(pData(target_Data)$GeneDetectionRate,
      breaks = c(0, 0.01, 0.05, 0.1, 0.15, 1),
      labels = c("<1%", "1-5%", "5-10%", "10-15%", ">15%"))
pData(target_Data)$DetectionThreshold <- factor(pData(target_Data)$DetectionThreshold, levels = c("<1%", "1-5%", "5-10%", "10-15%", ">15%"))

# stacked bar plot of different cut points (1%, 5%, 10%, 15%)
ggplot(pData(target_Data),
       aes(x = DetectionThreshold)) +
  geom_bar(aes(fill = region), width=0.9) +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
  theme_bw() +
  scale_x_discrete(drop=FALSE) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  labs(x = "Gene Detection Rate",
       y = "Segments, #",
       fill = "Segment Type")

# cut percent genes detected at 1, 5, 10, 15
kable(table(pData(target_Data)$DetectionThreshold,
            pData(target_Data)$class))
celastrol I/R sham
<1% 0 0 0
1-5% 0 0 0
5-10% 0 1 0
10-15% 0 0 0
>15% 45 37 47

Generally, 5-10% detection is a reasonable segment filtering threshold. However, based on the experimental design (e.g. segment types, size, nuclei) and tissue characteristics (e.g. type, age), these guidelines may require adjustment.

target_Data <-
  target_Data[, pData(target_Data)$GeneDetectionRate >= .1]

dim(target_Data)
## Features  Samples 
##    19963      129

Next, we determine the detection rate for genes across the study. To illustrate this idea, we create a small gene list (goi) to review.

library(scales) # for percent

# Calculate detection rate:
LOQ_Mat <- LOQ_Mat[, colnames(target_Data)]
fData(target_Data)$DetectedSegments <- rowSums(LOQ_Mat, na.rm = TRUE)
fData(target_Data)$DetectionRate <-
  fData(target_Data)$DetectedSegments / nrow(pData(target_Data))

# Gene of interest detection table
goi <- c("Epas1", "Hif1a", "Hnrnpa1", "Srsf3", "Srsf6", "Ptbp1", "Elavl1")
# kupffer_gene <- c("Ceacam1", "Tmsb4x", "Apoe", "Fth1", "B2m", "Ctsb", "Cd74", "Cd5l", "Lyz2", "Clec4f", "Calm1", "Nfkbia")
goi_df <- data.frame(
  Gene = goi,
  Number = fData(target_Data)[goi, "DetectedSegments"],
  DetectionRate = percent(fData(target_Data)[goi, "DetectionRate"]))
goi_df
##      Gene Number DetectionRate
## 1   Epas1    129          100%
## 2   Hif1a    129          100%
## 3 Hnrnpa1    129          100%
## 4   Srsf3    129          100%
## 5   Srsf6    129          100%
## 6   Ptbp1    129          100%
## 7  Elavl1    129          100%

We will graph the total number of genes detected in different percentages of segments. Based on the visualization below, we can better understand global gene detection in our study and select how many low detected genes to filter out of the dataset. Gene filtering increases performance of downstream statistical tests and improves interpretation of true biological signal.

# Plot detection rate:
plot_detect <- data.frame(Freq = c(1, 5, 10, 20, 30, 50, 70, 100))
plot_detect$Number <-
  unlist(lapply(c(0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 0.7, 1.0),
                function(x) {sum(fData(target_Data)$DetectionRate >= x)}))
plot_detect$Rate <- plot_detect$Number / nrow(fData(target_Data))
rownames(plot_detect) <- plot_detect$Freq

ggplot(plot_detect, aes(x = as.factor(Freq), y = Rate, fill = Rate)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = formatC(Number, format = "d", big.mark = ",")),
            vjust = 1.6, color = "black", size = 4) +
  scale_fill_gradient2(low = "orange2", mid = "lightblue",
                       high = "dodgerblue3", midpoint = 0.65,
                       limits = c(0,1),
                       labels = scales::percent) +
  theme_bw() +
  scale_y_continuous(labels = scales::percent, limits = c(0,1),
                     expand = expansion(mult = c(0, 0))) +
  labs(x = "% of Segments",
       y = "Genes Detected, % of Panel > LOQ")

We typically set a % Segment cutoff ranging from 5-20% based on the biological diversity of our dataset. For this study, we will select 10% as our cutoff. In other words, we will focus on the genes detected in at least 10% of our segments; we filter out the remainder of the targets.

Note: if we know that a key gene is represented in only a small number of segments (<10%) due to biological diversity, we may select a different cutoff or keep the target gene by manually selecting it for inclusion in the data object.

# Subset to target genes detected in at least 10% of the samples.
#   Also manually include the negative control probe, for downstream use
negativeProbefData <- subset(fData(target_Data), CodeClass == "Negative")
neg_probes <- unique(negativeProbefData$TargetName)
target_Data <- 
  target_Data[fData(target_Data)$DetectionRate >= 0.1 |
                    fData(target_Data)$TargetName %in% neg_probes, ]
dim(target_Data)
## Features  Samples 
##    10682      129
# retain only detected genes of interest
goi <- goi[goi %in% rownames(target_Data)]

We will now normalize the GeoMx data for downstream visualizations and differential expression. The two common methods for normalization of DSP-NGS RNA data are i) quartile 3 (Q3) or ii) background normalization.

Both of these normalization methods estimate a normalization factor per segment to bring the segment data distributions together. More advanced methods for normalization and modeling are under active development. However, for most studies, these methods are sufficient for understanding differences between biological classes of segments and samples.

Q3 normalization is typically the preferred normalization strategy for most DSP-NGS RNA studies. Given the low negative probe counts in this particular dataset as shown during Segment QC, we would further avoid background normalization as it may be less stable.

Before normalization, we will explore the relationship between the upper quartile (Q3) of the counts in each segment with the geometric mean of the negative control probes in the data. Ideally, there should be a separation between these two values to ensure we have stable measure of Q3 signal. If you do not see sufficient separation between these values, you may consider more aggressive filtering of low signal segments/genes.

library(reshape2)  # for melt
library(cowplot)   # for plot_grid

# Graph Q3 value vs negGeoMean of Negatives
ann_of_interest <- "region"
Stat_data <- 
  data.frame(row.names = colnames(exprs(target_Data)),
             Segment = colnames(exprs(target_Data)),
             Annotation = pData(target_Data)[, ann_of_interest],
             Q3 = unlist(apply(exprs(target_Data), 2,
                               quantile, 0.75, na.rm = TRUE)),
             NegProbe = exprs(target_Data)[neg_probes, ])
Stat_data_m <- melt(Stat_data, measure.vars = c("Q3", "NegProbe"),
                    variable.name = "Statistic", value.name = "Value")

plt1 <- ggplot(Stat_data_m,
               aes(x = Value, fill = Statistic)) +
  geom_histogram(bins = 40) + theme_bw() +
  scale_x_continuous(trans = "log2") +
  facet_wrap(~Annotation, nrow = 1) + 
  scale_fill_brewer(palette = 3, type = "qual") +
  labs(x = "Counts", y = "Segments, #")

plt2 <- ggplot(Stat_data,
               aes(x = NegProbe, y = Q3, color = Annotation)) +
  geom_abline(intercept = 0, slope = 1, lty = "dashed", color = "darkgray") +
  geom_point() + guides(color = "none") + theme_bw() +
  scale_x_continuous(trans = "log2") + 
  scale_y_continuous(trans = "log2") +
  theme(aspect.ratio = 1) +
  labs(x = "Negative Probe GeoMean, Counts", y = "Q3 Value, Counts")

plt3 <- ggplot(Stat_data,
               aes(x = NegProbe, y = Q3 / NegProbe, color = Annotation)) +
  geom_hline(yintercept = 1, lty = "dashed", color = "darkgray") +
  geom_point() + theme_bw() +
  scale_x_continuous(trans = "log2") + 
  scale_y_continuous(trans = "log2") +
  theme(aspect.ratio = 1) +
  labs(x = "Negative Probe GeoMean, Counts", y = "Q3/NegProbe Value, Counts")

btm_row <- plot_grid(plt2, plt3, nrow = 1, labels = c("B", ""),
                     rel_widths = c(0.43,0.57))
plot_grid(plt1, btm_row, ncol = 1, labels = c("A", ""))

As expected, we see separation of the Q3 and negative probe counts at both the distribution (A) and per segment (B) levels. For additional conceptual guidance, please refer to our Data Analysis White Paper for DSP-NGS Assays.

Next, we normalize our data. We will use Q3 normalized data moving forward. We use the normalize function from NanoStringNCTools to create normalization factors reflecting each data type. Upper quartile (Q3) normalization is performed using norm_method = “quant” setting the desiredQuantile flag to 0.75. Other quantiles could be specified by changing that value. We save the normalized data to a specific slot using toELT = “q_norm”. Similarly background normalization is performed by setting norm_method = “neg” and toElt = “neg_norm”.

# Q3 norm (75th percentile) for WTA/CTA  with or without custom spike-ins
target_Data <- normalize(target_Data ,
                             norm_method = "quant", 
                             desiredQuantile = .75,
                             toElt = "q_norm")

# Background normalization for WTA/CTA without custom spike-in
target_Data <- normalize(target_Data ,
                             norm_method = "neg", 
                             fromElt = "exprs",
                             toElt = "neg_norm")

Differential Expression

We would like to compare diseased versus healthy livers. Because we are comparing disease status, which is specific to the entire kidney, we will use the LMM model without a random slope. Disease (testClass) is our test variable. Like our previous LMM example, we control for tissue subsampling with slide name as the intercept.

# convert test variables to factors
pData(target_Data)$testClass <-
    factor(pData(target_Data)$class, c("I/R", "sham"))
pData(target_Data)[["slide"]] <- 
  factor(pData(target_Data)[["slide name"]])
assayDataElement(object = target_Data, elt = "log_q") <-
  assayDataApply(target_Data, 2, FUN = log, base = 2, elt = "q_norm")

# run LMM:
# formula follows conventions defined by the lme4 package
results2 <- c()
for(region in c("zone 1", "zone 2", "zone 3")) {
    ind <- pData(target_Data)$region == region
    mixedOutmc <-
        mixedModelDE(target_Data[, ind],
                     elt = "log_q",
                     modelFormula = ~ testClass + (1 | slide),
                     groupVar = "testClass",
                     nCores = parallel::detectCores(),
                     multiCore = FALSE)
    
    # format results2 as data.frame
    r_test <- do.call(rbind, mixedOutmc["lsmeans", ])
    tests <- rownames(r_test)
    r_test <- as.data.frame(r_test)
    r_test$Contrast <- tests
    
    # use lapply in case you have multiple levels of your test factor to
    # correctly associate gene name with it's row in the results2 table
    r_test$Gene <- 
        unlist(lapply(colnames(mixedOutmc),
                      rep, nrow(mixedOutmc["lsmeans", ][[1]])))
    r_test$Subset <- region
    r_test$FDR <- p.adjust(r_test$`Pr(>|t|)`, method = "fdr")
    r_test <- r_test[, c("Gene", "Subset", "Contrast", "Estimate", 
                         "Pr(>|t|)", "FDR")]
    results2 <- rbind(results2, r_test)
}

We save the LMM outputs into a table (results2) containing three of the key features for differential expression: the log2 fold change value (Estimate), P-value (Pr(>|t|)), and false-discovery adjusted P-values (FDR). The contrast column is used to interpret the log2 fold change value as it specifies which levels are compared (e.g. positive fold change values when comparing glomerulus - tubule indicates an enrichment in the glomerulus; negative indicates enrichment in tubules).

We can display these results by subsetting the results table.

goi <- c("Epas1", "Hif1a", "Hnrnpa1", "Srsf3", "Srsf6", "Ptbp1", "Elavl1")
kable(subset(results2, Gene %in% goi & Subset == "zone 3"), digits = 3,
      caption = "DE results for Genes of Interest (zone 3)",
      align = "lc", row.names = FALSE)
DE results for Genes of Interest (zone 3)
Gene Subset Contrast Estimate Pr(>|t|) FDR
Elavl1 zone 3 I/R - sham 0.586 0.000 0.000
Epas1 zone 3 I/R - sham 0.203 0.104 0.162
Hif1a zone 3 I/R - sham 1.080 0.000 0.000
Hnrnpa1 zone 3 I/R - sham 0.314 0.001 0.003
Ptbp1 zone 3 I/R - sham 1.202 0.000 0.000
Srsf3 zone 3 I/R - sham 0.874 0.000 0.000
Srsf6 zone 3 I/R - sham 0.432 0.000 0.000
kable(subset(results2, Gene %in% goi & Subset == "zone 2"), digits = 3,
      caption = "DE results for Genes of Interest (zone 2)",
      align = "lc", row.names = FALSE)
DE results for Genes of Interest (zone 2)
Gene Subset Contrast Estimate Pr(>|t|) FDR
Elavl1 zone 2 I/R - sham 0.832 0.000 0.000
Epas1 zone 2 I/R - sham 0.440 0.000 0.000
Hif1a zone 2 I/R - sham 1.078 0.000 0.000
Hnrnpa1 zone 2 I/R - sham 0.193 0.021 0.037
Ptbp1 zone 2 I/R - sham 1.267 0.000 0.000
Srsf3 zone 2 I/R - sham 0.670 0.000 0.000
Srsf6 zone 2 I/R - sham 0.452 0.000 0.000
kable(subset(results2, Gene %in% goi & Subset == "zone 1"), digits = 3,
      caption = "DE results for Genes of Interest (zone 1)",
      align = "lc", row.names = FALSE)
DE results for Genes of Interest (zone 1)
Gene Subset Contrast Estimate Pr(>|t|) FDR
Elavl1 zone 1 I/R - sham 0.911 0.000 0.00
Epas1 zone 1 I/R - sham 0.864 0.000 0.00
Hif1a zone 1 I/R - sham 0.752 0.000 0.00
Hnrnpa1 zone 1 I/R - sham 0.085 0.317 0.38
Ptbp1 zone 1 I/R - sham 0.950 0.000 0.00
Srsf3 zone 1 I/R - sham 0.725 0.000 0.00
Srsf6 zone 1 I/R - sham 0.697 0.000 0.00

A canonical visualization for interpreting differential gene expression results is the volcano plot. We will look at the LMM results from our sham vs. I/R comparison, highlighting the genes we’re interested in.

library(ggrepel) 
# Categorize results2 based on P-value & FDR for plotting
results2$Color <- "NS or FC < 0.5"
results2$Color[results2$`Pr(>|t|)` < 0.05] <- "P < 0.05"
results2$Color[results2$FDR < 0.05] <- "FDR < 0.05"
results2$Color[results2$FDR < 0.001] <- "FDR < 0.001"
results2$Color[abs(results2$Estimate) < 0.5] <- "NS or FC < 0.5"
results2$Color <- factor(results2$Color,
                        levels = c("NS or FC < 0.5", "P < 0.05",
                                   "FDR < 0.05", "FDR < 0.001"))

# pick top genes for either side of volcano to label
# order genes for convenience:
results2$invert_P <- (-log10(results2$`Pr(>|t|)`)) * sign(results2$Estimate)
top_g <- c()
for(cond in c("zone 1", "zone 2", "zone 3")) {
    ind <- results2$Subset == cond
    top_g <- c(top_g,
               results2[ind, 'Gene'][
                   order(results2[ind, 'invert_P'], decreasing = TRUE)[1:15]],
               results2[ind, 'Gene'][
                   order(results2[ind, 'invert_P'], decreasing = FALSE)[1:15]])
}
top_g <- unique(top_g)
results2 <- results2[, -1*ncol(results2)] # remove invert_P from matrix

# Graph results2
ggplot(results2,
       aes(x = Estimate, y = -log10(`Pr(>|t|)`),
           color = Color, label = Gene)) +
    geom_vline(xintercept = c(0.5, -0.5), lty = "dashed") +
    geom_hline(yintercept = -log10(0.05), lty = "dashed") +
    geom_point(size=0.2) +
    labs(x = "Enriched in sham <- log2(FC) -> Enriched in I/R",
         y = "Significance, -log10(P)",
         color = "Significance") +
    scale_color_manual(values = c(`FDR < 0.001` = "dodgerblue",
                                  `FDR < 0.05` = "lightblue",
                                  `P < 0.05` = "orange2",
                                  `NS or FC < 0.5` = "gray"),
                       guide = guide_legend(override.aes = list(size = 4))) +
    scale_y_continuous(limits = c(0.001, 25), expand = expansion(mult = c(0,0.05))) +
    geom_text_repel(data = subset(results2, Gene %in% c("Epas1", "Hif1a", "Hnrnpa1", "Srsf3", "Srsf6", "Ptbp1", "Elavl1")),
                    size = 4, point.padding = 0.15, color = "black",nudge_y = 4,nudge_x = 6,
                    min.segment.length = .1, box.padding = .2,
                    max.overlaps = 50) +
  # geom_text_repel(data = subset(results2, Gene %in% c("Ceacam1")),
  #                   size = 4, point.padding = 0.15, color = "black",nudge_y = 4,nudge_x = -4,
  #                   min.segment.length = .1, box.padding = .2,
  #                   max.overlaps = 50) +
    theme_bw(base_size = 16) +
    theme(legend.position = "bottom") +
    facet_wrap(~Subset, scales = "free_y")

In addition to generating individual gene box plots or volcano plots, we can again create a heatmap from our data.

Plot heatmap:

pheatmap(t(goiData[,-(9:8)]),
         scale = "row", 
         show_rownames = TRUE, show_colnames = FALSE,
         border_color = "grey", cluster_cols = F,
         treeheight_row = 0,
         treeheight_col = 0,
         breaks = seq(-2, 3, 0.05), 
         # color = colorRampPalette(c("purple3", "black", "yellow2"))(120),
         cellwidth=6, cellheight=10,
         annotation_col = goiData[,(9:8)])

Cell-type deconvolution

First, we need to ormalize gene expression data using different normalization methods:

target_Data <- normalize(
  target_Data,
  fromElt = "exprs",
  toElt = "exprs_norm"
)

The spatialdecon function takes 3 arguments of expression data:

  1. The normalized data.
  2. A matrix of expected background for all data points in the normalized data matrix.
  3. Optionally, either a matrix of per-data-point weights, or the raw data, which is used to derive weights (low counts are less statistically stable, and this allows spatialdecon to down-weight them.)

We estimate each data point’s expected background from the negative control probes from its corresponding observation: This study has two probesets, and the genes from each probeset will have distinct background values. The function “derive_GeoMx_background” conveniently estimates background of all data points, accounting for which probeset each gene belongs to:

We download a matrix of cell profiles derived from scRNA-seq of a mouse liver

Plot the loading for each cell type from the scRNA-seq liver data

Combining cell types

When two cell types are too similar, the estimation of their abundances becomes unstable. However, their sum can still be estimated easily. The function “collapseCellTypes” takes a deconvolution results object and collapses any closely-related cell types you tell it to:

Replot the cell composition histogram:

Reverse deconvolution

Once cell type abundance has been estimated, we can flip the deconvolution around, modelling the expression data as a function of cell abundances, and thereby deriving:

  1. Estimated expression of each gene in each cell type. (Including for genes not present in your cell profile matrix)
  2. Fitted expression values for each gene based on cell mixing.
  3. Residuals of each gene: how does their expression compare to what cell mixing would predict?
  4. Two metrics of how well genes are predicted by/ redundant with cell mixing: correlation between observed and fitted expression, and residual SD.

The function “reversedecon” runs this model.